Maximizing Text-Mining Performance

نویسندگان

  • Sholom M. Weiss
  • Chidanand Apte
  • Fred J. Damerau
  • David E. Johnson
  • Frank J. Oles
  • Thilo Goetz
چکیده

WITH THE ADVENT OF CENTRALized data warehouses, where data might be stored as electronic documents or as text fields in databases, text mining has increased in importance and economic value. One important goal in text mining is automatic classification of electronic documents. Computer programs scan text in a document and apply a model that assigns the document to one or more prespecified topics. Researchers have used benchmark data, such as the Reuters-21578 test collection, to measure advances in automated text categorization. Conventional methods such as decision trees have had competitive, but not optimal, predictive performance. Using the Reuters collection, we show that adaptive resampling techniques can improve decision-tree performance and that relatively small, pooled local dictionaries are effective. We’ve applied these techniques to online banking applications to enhance automated e-mail routing.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cut-off Grade Optimization for Maximizing the Output Rate

In the open-pit mining, one of the first decisions that must be made in production planning stage, after completing the design of final pit limits, is determining of the processing plant cut-off grade. Since this grade has an essential effect on operations, choosing the optimum cut-off grade is of considerable importance. Different goals may be used for determining optimum cut-off grade. One of...

متن کامل

Unsupervised Phrasal Near-Synonym Generation from Text Corpora

Unsupervised discovery of synonymous phrases is useful in a variety of tasks ranging from text mining and search engines to semantic analysis and machine translation. This paper presents an unsupervised corpus-based conditional model: Near-Synonym System (NeSS) for finding phrasal synonyms and near synonyms that requires only a large monolingual corpus. The method is based on maximizing informa...

متن کامل

Conditional Information Bottleneck Clustering

We present an extension of the well-known information bottleneck framework, called conditional information bottleneck, which takes negative relevance information into account by maximizing a conditional mutual information score. This general approach can be utilized in a data mining context to extract relevant information that is at the same time novel relative to known properties or structures...

متن کامل

ارائه مدلی برای استخراج اطلاعات از مستندات متنی، مبتنی بر متن‌کاوی در حوزه یادگیری الکترونیکی

As computer networks become the backbones of science and economy, enormous quantities documents become available. So, for extracting useful information from textual data, text mining techniques have been used. Text Mining has become an important research area that discoveries unknown information, facts or new hypotheses by automatically extracting information from different written documents. T...

متن کامل

A review of text mining approaches and their function in discovering and extracting a topic

Background and aim: Four text mining methods are examined and focused on understanding and identifying their properties and limitations in subject discovery. Methodology: The study is an analytical review of the literature of text mining and topic modeling.  Findings: LSA could be used to classify specific and unique topics in documents that address only a single topic. The other three text min...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999